164 research outputs found

    Mass accumulation rate changes in Chinese loess during MIS 2, and asynchrony with records from Greenland ice cores and North Pacific Ocean sediments during the Last Glacial Maximum

    Get PDF
    Sensitivity-corrected quartz optically stimulated luminescence (OSL) dating methods have been widely accepted as a promising tool for the construction of late Pleistocene chronology and mass or dust accumulation rates (MARs or DARs) on the Chinese Loess Plateau (CLP). Many quartz OSL ages covering marine isotope stage (MIS) 2 (equal to L1-1 in Chinese loess) have been determined for individual sites within the CLP in the past decade. However, there is still a lack of detailed MAR or DAR reconstruction during MIS 2 across the whole of the CLP. Here, we present detailed MARs determined for eight sites with closely-spaced quartz OSL ages covering MIS 2, and relative MARs suggested by a probability density analysis of 159 quartz OSL ages ranging from 30 to 10 ka ago, from 15 sites on the CLP. The results show enhanced dust accumulation during the Last Glacial Maximum (LGM), with particularly rapid dust accumulation from 23 to 19 ka ago (the late LGM). In contrast, MARs determined for the last deglaciation (from 19 to 12 ka ago) are low. The MAR changes during MIS 2 in Chinese loess are mainly controlled by the East Asian winter monsoon (EAWM) intensity, which is forced by Northern Hemisphere ice volume. The MAR changes also indicate that dust accumulation during MIS 2 is generally continuous at millennial time scales on the CLP. Comparison of Asian-sourced aeolian dust MARs in Chinese loess with those preserved in Greenland ice cores and North Pacific Ocean sediments indicates that rapid dust accumulation occurred from 26 to 23 ka ago (the early LGM) in Greenland ice cores and North Pacific Ocean sediments, suggesting a several kilo-year difference in timing when compared with the rapid dust accumulation during the late LGM in Chinese loess. This asynchronous timing in enhanced dust accumulation is probably related to both changes in the EAWM intensity and changes in the mean position of zone axis of the Westerly jet, both of which are greatly influenced by Northern Hemisphere ice volume. This study highlights the possible influence of changes in the mean position of zone axis of the Westerly jet on long-range transport of Asian-sourced dust.</p

    EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis

    Full text link
    There has been significant progress in emotional Text-To-Speech (TTS) synthesis technology in recent years. However, existing methods primarily focus on the synthesis of a limited number of emotion types and have achieved unsatisfactory performance in intensity control. To address these limitations, we propose EmoMix, which can generate emotional speech with specified intensity or a mixture of emotions. Specifically, EmoMix is a controllable emotional TTS model based on a diffusion probabilistic model and a pre-trained speech emotion recognition (SER) model used to extract emotion embedding. Mixed emotion synthesis is achieved by combining the noises predicted by diffusion model conditioned on different emotions during only one sampling process at the run-time. We further apply the Neutral and specific primary emotion mixed in varying degrees to control intensity. Experimental results validate the effectiveness of EmoMix for synthesizing mixed emotion and intensity control.Comment: Accepted by 24th Annual Conference of the International Speech Communication Association (INTERSPEECH 2023

    Improving Music Genre Classification from multi-modal properties of music and genre correlations Perspective

    Full text link
    Music genre classification has been widely studied in past few years for its various applications in music information retrieval. Previous works tend to perform unsatisfactorily, since those methods only use audio content or jointly use audio content and lyrics content inefficiently. In addition, as genres normally co-occur in a music track, it is desirable to capture and model the genre correlations to improve the performance of multi-label music genre classification. To solve these issues, we present a novel multi-modal method leveraging audio-lyrics contrastive loss and two symmetric cross-modal attention, to align and fuse features from audio and lyrics. Furthermore, based on the nature of the multi-label classification, a genre correlations extraction module is presented to capture and model potential genre correlations. Extensive experiments demonstrate that our proposed method significantly surpasses other multi-label music genre classification methods and achieves state-of-the-art result on Music4All dataset.Comment: Accepted by ICASSP 202

    CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation

    Full text link
    Better disentanglement of speech representation is essential to improve the quality of voice conversion. Recently contrastive learning is applied to voice conversion successfully based on speaker labels. However, the performance of model will reduce in conversion between similar speakers. Hence, we propose an augmented negative sample selection to address the issue. Specifically, we create hard negative samples based on the proposed speaker fusion module to improve learning ability of speaker encoder. Furthermore, considering the fine-grain modeling of speaker style, we employ a reference encoder to extract fine-grained style and conduct the augmented contrastive learning on global style. The experimental results show that the proposed method outperforms previous work in voice conversion tasks.Comment: Accepted by the 21st IEEE International Symposium on Parallel and Distributed Processing with Applications (IEEE ISPA 2023

    DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

    Full text link
    Generating realistic talking faces is a complex and widely discussed task with numerous applications. In this paper, we present DiffTalker, a novel model designed to generate lifelike talking faces through audio and landmark co-driving. DiffTalker addresses the challenges associated with directly applying diffusion models to audio control, which are traditionally trained on text-image pairs. DiffTalker consists of two agent networks: a transformer-based landmarks completion network for geometric accuracy and a diffusion-based face generation network for texture details. Landmarks play a pivotal role in establishing a seamless connection between the audio and image domains, facilitating the incorporation of knowledge from pre-trained diffusion models. This innovative approach efficiently produces articulate-speaking faces. Experimental results showcase DiffTalker's superior performance in producing clear and geometrically accurate talking faces, all without the need for additional alignment between audio and image features.Comment: submmit to ICASSP 202

    QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis

    Full text link
    Recent expressive text to speech (TTS) models focus on synthesizing emotional speech, but some fine-grained styles such as intonation are neglected. In this paper, we propose QI-TTS which aims to better transfer and control intonation to further deliver the speaker's questioning intention while transferring emotion from reference speech. We propose a multi-style extractor to extract style embedding from two different levels. While the sentence level represents emotion, the final syllable level represents intonation. For fine-grained intonation control, we use relative attributes to represent intonation intensity at the syllable level.Experiments have validated the effectiveness of QI-TTS for improving intonation expressiveness in emotional speech synthesis.Comment: Accepted by ICASSP 202

    FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework

    Full text link
    This paper integrates graph-to-sequence into an end-to-end text-to-speech framework for syntax-aware modelling with syntactic information of input text. Specifically, the input text is parsed by a dependency parsing module to form a syntactic graph. The syntactic graph is then encoded by a graph encoder to extract the syntactic hidden information, which is concatenated with phoneme embedding and input to the alignment and flow-based decoding modules to generate the raw audio waveform. The model is experimented on two languages, English and Mandarin, using single-speaker, few samples of target speakers, and multi-speaker datasets, respectively. Experimental results show better prosodic consistency performance between input text and generated audio, and also get higher scores in the subjective prosodic evaluation, and show the ability of voice conversion. Besides, the efficiency of the model is largely boosted through the design of the AI chip operator with 5x acceleration.Comment: Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023

    PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

    Full text link
    Voice conversion as the style transfer task applied to speech, refers to converting one person's speech into a new speech that sounds like another person's. Up to now, there has been a lot of research devoted to better implementation of VC tasks. However, a good voice conversion model should not only match the timbre information of the target speaker, but also expressive information such as prosody, pace, pause, etc. In this context, prosody modeling is crucial for achieving expressive voice conversion that sounds natural and convincing. Unfortunately, prosody modeling is important but challenging, especially without text transcriptions. In this paper, we firstly propose a novel voice conversion framework named 'PMVC', which effectively separates and models the content, timbre, and prosodic information from the speech without text transcriptions. Specially, we introduce a new speech augmentation algorithm for robust prosody extraction. And building upon this, mask and predict mechanism is applied in the disentanglement of prosody and content information. The experimental results on the AIShell-3 corpus supports our improvement of naturalness and similarity of converted speech.Comment: Accepted by the 31st ACM International Conference on Multimedia (MM2023
    corecore